Binomial-discrete Erlang-truncated exponential mixture and its application in cancer disease

Among diseases, cancer exhibits the fastest global spread, presenting a substantial challenge for patients, their families, and the communities they belong to. This paper is devoted to modeling such a disease as a special case. A newly proposed distribution called the binomial-discrete Erlang-truncated exponential (BDETE) is introduced. The BDETE is a mixture of binomial distribution with the number of trials (parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}n) taken after a discrete Erlang-truncated exponential distribution. A comprehensive mathematical treatment of the proposed distribution and expressions of its density, cumulative distribution function, survival function, failure rate function, Quantile function, moment generating function, Shannon entropy, order statistics, and stress-strength reliability, are provided. The distribution's parameters are estimated using the maximum likelihood method. Two real-world lifetime count data sets from the cancer disease, both of which are right-skewed and over-dispersed, are fitted using the proposed BDETE distribution to evaluate its efficacy and viability. We expect the findings to become standard works in probability theory and its related fields.

www.nature.com/scientificreports/ Mixture distributions can be applied in cancer disease to identify different subtypes and stages of the disease based on the expression of biomarkers. This approach can lead to better diagnosis, prognosis, and treatment of cancer patients. Prabakaran et al. 13 developed the Gaussian mixture model (GMM)-based classifier to improve molecular stratification of patients with breast cancer. Gaussian Mixture Models are often used for clustering and classification tasks in epidemiology. Their application in genotyping and disease subtyping has been explored in numerous studies, as highlighted by McLachlan et al. 14 . Held et al. 15 applied the Beta distribution to infectious disease data analysis. Noor et al. 16 preferred a novel four-component mixture model under Bayesian estimation to estimate the average number of incidences and deaths of both genders in different age groups, considering 28 different kinds of cancer diagnosed in recent years. In this paper, the proposed mixture distribution is fitted to two datasets of cancer disease, and the results showed that the proposed mixture distribution is well suited to model these datasets. In other words, we devote this paper to modeling a cancer disease using a new mixture distribution called the binomial-discrete Erlang-truncated exponential distribution (BDETE). This mixture distribution is a combination of the binomial distribution with the discrete Erlang-truncated exponential distribution. We use the probability-generating function of mixtures to find the pmf of the BDETE distribution. We look at some statistical properties of the proposed distribution and use the MLE to estimate its parameters.
The proposed BDETE distribution with three parameters is interesting because it has an increasing hazard rate function and a decreasing probability mass function. The novel lifetime mixture distribution is useful because it can model a real lifetime count data set of cancer disease that is skewed to the right and over-dispersed.
Binomial and discrete Erlang-truncated exponential distributions. The probability mass function (pmf) and the associated probability-generating function (pgf) of a binomial random variable X with parameters n and p are given as The pmf of a discrete Erlang-truncated exponential (DETE) random variable N with parameters n , β and ω is given as 17 where n is the number of failures before the first success. The DETE distribution's mean and variance are stated as Mixing binomial and other distributions with a probability-generating function method. If we assume that the parameter n in the binomial distribution in Eq. (1) is a random variable with pmf f N (n, ω, p) , then we can use the probability generating function approach to get the binomial mixed distribution as 12 where P D (d; n, p) is pgf for the binomial distribution, while p , ω and β are the parameters of the mixture distribution.
This paper's remaining sections are organized as follows: The proposed distribution BDETE is presented in "Binomial-discrete Erlang-truncated exponential distribution" section, and "Distributional properties" section demonstrates its statistical features, including the quantile function, the moment-generating function, the Shannon entropy, the order statistics, and the stress-strength parameter. The maximum likelihood technique is described in "Maximum likelihood estimation" section for estimating BDETE mixing parameters. In "Application" section, two real data sets are used to illustrate the performance of the BDETE distribution. Finally, some final thoughts are offered in "Conclusion remarks" section.

Binomial-discrete Erlang-truncated exponential distribution
This section evaluates and discusses the mathematical formulae for the pmf and cdf of the proposed Binomialdiscrete Erlang-truncated exponential mixture (BDETE). Here we also derive the hazard and survival functions for the BDETE distribution.
Probability mass and cumulative distribution functions for the BDETE. If we assume that the parameter n in the binomial distribution in Eq. (1) follows a discrete Erlang-truncated exponential distribution in Eq. (4), we can use the probability generating function method in Eq. (2) to get the pmf of the proposed BDETE distribution as The binomial-geometric distribution can be obtained from Eq. (5) by taking β = 1 and ω = 1 − θ as follows 12 The pmf of the BDETE distribution for varying values of the distribution's parameters are shown in Figs. 1, 2, and 3, while the cdf are presented in Figs. 4, 5 and 6.
The proposed BDETE distribution is right-skewed, and its pmf is a declining function, as shown in Figs. 1 through 3.
Survival and hazard rate functions. The survival function of X is: P D d; p, ω, β = ∞ n=0 P D d; n, p f N (n; ω, β) www.nature.com/scientificreports/ The hazard function is as follows:  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 p=0.25 p=0.5 p=0.9  www.nature.com/scientificreports/ The hazard function of the BDETE is shown in Table 1 and Fig. 7 for a given set of p, ω and β values. Based on Table 1 and Fig. 7, we observe that the hazard function goes down as both p and θ go up. On the other hand, as β goes up, the hazard function goes up.

Distributional properties
In this section, we develop some statistical properties of the BDETE distribution, such as the quantile function, the moment generating function, and some other related measures. We also define some other techniques, like the Shannon entropy and the order statistics.
Quantile function. By inverting the cdf in Eq. (6), the quantile of order 0 < r < 1 could be derived as fol- Thus, The r th quantile is The BDETE distribution's median can be computed by substituting by r = 1 2 in Eq. (8) as follows: www.nature.com/scientificreports/ The moment-generating function. The moment-generating function of a random variable X with a BDETE and parameters (p, ω, β) is deduced as The mean (first moment) of the BDETE distribution can be calculated using Eq. (9) as follows: The 2nd moment about the origin is As a result, the BDETE distribution's variance is given by It is obvious from Eqs. (10) and (11) that This demonstrates that the BDETE distribution is always over-dispersed (the variance is larger than the mean), making it appropriate for usage with such data.
The 3rd moment about the origin is The 4th moment about the origin is The BDETE distribution has a coefficient of variation (C.V), coefficient of Skewness ( √ β 1 ), the coefficient of Kurtosis ( β 2 ), and the index of dispersion ( γ ) as Table 2 shows the mean, variance, and skewness of the BDETE distribution for various combinations of p, ω and β.
The results in Table 2 show that when both p and ω increase, so do the proposed distribution's mean and variance. Conversely, when β rises, the mean and variance fall. On the other hand, when both p and ω increase, the coefficient of skewness decreases, while when β rises, so do the coefficient of skewness. Table 2 also demonstrates that the proposed BDETE distribution has over-dispersion and positive skewness. www.nature.com/scientificreports/ Shannon entropy. The Shannon entropy is one of many entropy and information indices that have been made and used in a wide range of fields and situations. This measure is defined as The Shannon entropy of a random variable X with a BDETE distribution pmf of Eq. (5) is www.nature.com/scientificreports/ Order statistics. In the field of non-parametric statistics and inference, order statistics are the most significant and fundamental tools. They employ a variety of approaches to address estimation and hypothesis testing issues. Therefore, the purpose of this subsection is to develop some order statistics for the BDETE distribution, including the maximum, minimum, and median order statistics. Suppose f k (x; p, ω, β) and F k (x; p, ω, β) are the pmf and cdf of the kth order statistic of a random sample; X 1 , X 2 , . . . , X n ; of size n , taken from BDETE.

The kth order statistic's pmf is
The kth order statistic's cdf is Let X (1) = min(X 1 , X 2 , . . . , X n ),X (n) = max(X 1 , X 2 , . . . , X n ), and X (m+1) with m = n 2 be the minimum, maximum and medium order statistics, respectively. Therefore, result, the pmfs of the minimum, maximum, and median are Estimation of Stress-strength for BDETE distribution. In this part, we look at how to estimate the stress-strength parameter when both the strength and the stress are random variables with the BDETE distribution.
The discrete version of a stress-strength parameter is specified as where f X (x) and F X (x) are the pmf and cdf of the independent discrete random variables X and Y, respectively. Suppose X and Y are two independent random variables having the BDETE distribution with parameters BDETE(p 1 , ω 1 , β 1 ) and BDETE(p 2 , ω 2 , β 2 ) respectively. The stress-strength parameter for the BDETE distribution is given as www.nature.com/scientificreports/

Maximum likelihood estimation
The goal of this section is to find the maximum likelihood estimate (MLE) for the BDETE distribution parameters. Let X 1 , X 2 , . . . , X n be a random sample of size n having the BDETE distribution. The log-likelihood is Further differentiating the log-likelihood in Eq. (12) partially with respect to p , ω and β , we get the likelihood equations as The solutions of likelihood Eqs. (13), (14), and (15) provide the MLEs of p , ω and β , which can be obtained by numerical methods. Since the MLE of the vector of unknown parameters τ = (p, ω, β) T cannot be derived in closed forms, it is, therefore, hard to figure out the exact MLEs for the BDETE's parameters.
The second partial derivatives are given below Lawless 18 defined the asymptotic distribution of the MLE τ as where I −1 (τ ) is the inverse of Fisher's information matrix of the unknown parameters τ = (p, ω, β) T as follows:

Application
Using the proposed BDETE distribution, we examine two data sets in this section to illustrate its use. The BDETE distribution is compared to some related distributions include, the binomial geometric (BG) 12 , negative binomial-discrete Erlang-truncated exponential (NBDETE) 19 , discrete Erlang-truncated Exponential (DETE) 17 , discrete extended Erlang-truncated Exponential (DEETE) 20 , and the discrete Kumaraswamy Erlang-truncated exponential distribution (DKw_ETE) 21 to evaluate its performance and check its goodness of fit. Both the chisquare statistic and the -log-likelihood (−log(L)) are used as evaluation tools. Two right-skewed, over-dispersed real lifetime count data sets from the cancer disease are fitted with the BDETE distribution. The first data in Table 3, provided by Klein Moeschberger 22 describes the death times, expressed in weeks, of 30 tongue cancer patients. This data was used by Eledum and El-Alosey 12 to study the binomial geometric distribution. The average, variance, and skewness for this data respectively are 50.03,1945.84, and 0.972. The second data set in Table 4, released by Lawless 18 , indicates the lengths of remission in weeks for a group of 30 leukemia patients taking a specific kind of medicine. This data was utilized by Eledum and El-Alosey 12 to assess the binomial geometric distribution. The results of the two data sets are demonstrated in Tables 5 and 6.
From the results in Table 5, we can see that the suggested BDETE distribution has the smallest number for −logL (157.487) compared to the other similar distributions (the smaller, the better). On the other hand, this value, along with the value of the χ 2 statistic (23.12) and its associated p-value (0.5960), shows that the suggested BDETE distribution is the best model to fit the tongue cancer patient's data set. Since this is the case, all the studied distributions fit this data set well. Table 6 shows that, among the comparative distributions, the proposed BDETE distribution has the least value for −logL (127.24). This result, combined with the χ 2 statistic value of (33.01) and the corresponding p-value of (0.1038) explains that the proposed BDETE distribution is the most appropriate model for the leukemia patient's data set. On the other hand, all distributions that were considered fit the data well.

Conclusion remarks
This paper developed a novel mixture of binomial distribution called the Binomial-discrete Erlang-truncated exponential distribution (BDETE), which was created by combining the binomial with the discrete Erlang-truncated exponential distribution using the probability generating function method. We look at some of the BDETE statistical properties and use the maximum likelihood method to estimate its parameters. The new compounding distribution has an increasing hazard rate function depending on the behavior of the distribution's parameters. Two real-world lifetime count data sets from the cancer disease, both of which are right-skewed and overdispersed, are fitted using the proposed BDETE distribution to evaluate its efficacy and viability. The application showed that the proposed distribution is the easiest model to fit a real lifetime count data set of cancer diseases that is right-skewed, over-dispersed, and has a decreasing probability mass function. We recommend using the proposed BDETE distribution for data modeling in applications of life-time count data from the medical field, especially in cancer diseases, based on the merits of increasing failure rate and decreasing probability mass function. In future studies, we can do another mixing of the BDETE distribution to increase the distribution flexibility.

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Received: 19 April 2023; Accepted: 13 July 2023 Table 6. Parameters estimates, −log (L), k-s test value and p-value for the selected distributions of the leukemia patient's data set.